-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add metric pgrst_jwt_cache_size in admin server #3802
base: main
Are you sure you want to change the base?
Conversation
Don't forget the feedback about the actual size in bytes #3801 (comment) Ultimately, I believe we're going to need an LRU cache for the JWT cache to be production ready. So having this metric in bytes will be useful now and later. |
1c8fef6
to
9a23b43
Compare
Calculating "actual" byte size of cache is sort of tricky (haskell laziness is a lot to deal with sometimes) and I still haven't figured it out YET. In the meantime, I have written some code to approximate the cache size in bytes. It works as follows:
Now, we can use |
1254ed5
to
11727ab
Compare
This is pretty cool, so I do see the size starting at 0 then increasing as I do requests: # $ nix-shell
# $ PGRST_ADMIN_SERVER_PORT=3001 PGRST_JWT_CACHE_MAX_LIFETIME=30000 postgrest-with-postgresql-16 -f test/spec/fixtures/load.sql postgrest-run
$ curl localhost:3001/metrics
# HELP pgrst_jwt_cache_size The number of cached JWTs
# TYPE pgrst_jwt_cache_size gauge
pgrst_jwt_cache_size 0.0
$ curl localhost:3000/authors_only -H "Authorization: Bearer $(postgrest-gen-jwt --exp 10 postgrest_test_author)"
[]
$ curl localhost:3001/metrics
..
pgrst_jwt_cache_size 72.0
$ curl localhost:3000/authors_only -H "Authorization: Bearer $(postgrest-gen-jwt --exp 10 postgrest_test_author)"
[]
$ curl localhost:3001/metrics
..
pgrst_jwt_cache_size 144.0
$ curl localhost:3000/authors_only -H "Authorization: Bearer $(postgrest-gen-jwt --exp 10 postgrest_test_author)"
[]
$ curl localhost:3001/metrics
..
pgrst_jwt_cache_size 216.0 Of course this doesn't drop down after a while because we need #3801 for that. One issue that I've noticed is that we're printing empty log lines for each request:
This is due to the addition of the new Observation postgrest/src/PostgREST/Observation.hs Line 60 in 9a23b43
And the following line here: postgrest/src/PostgREST/Observation.hs Line 146 in 9a23b43
This is surprising behavior, I'll try to refactor this on a new PR. For now, how about printing a message like:
This should only happen for a log-level greater than debug, check how this is done on: postgrest/src/PostgREST/Logger.hs Lines 79 to 96 in 9a23b43
|
The above is understandable. What we really need is a good enough approximation so we have an order-of-magnitude understanding to see if the cache size is in KB, MB or GB. So far this seems enough. We should definitely document why we do an approximation though. |
I've just noticed that perf badly drops down on this PR:
https://github.com/PostgREST/postgrest/pull/3802/checks?check_run_id=34517395528 The recursiveSize function says:
@taimoorzaeem What could we do to avoid this drop? Maybe calculate the cache size periodically on a background thread? Any thoughts? |
11727ab
to
145c236
Compare
Now that I have gotten better at haskell 🚀, I solved the issue and we don't need an "approximation". We CAN calculate full cache size in bytes. |
4b1afd1
to
d9de43a
Compare
Load test and memory test failing with:
Is there any additional configuration that needs to be added for profiling? |
nix/overlays/haskell-packages.nix
Outdated
# nixpkgs have ghc-datasize-0.2.7 marked as broken | ||
ghc-datasize = lib.markUnbroken prev.ghc-datasize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# nixpkgs have ghc-datasize-0.2.7 marked as broken | |
ghc-datasize = lib.markUnbroken prev.ghc-datasize; | |
# TODO: Remove this once https://github.com/NixOS/nixpkgs/pull/375121 | |
# has made it to us. | |
ghc-datasize = lib.markUnbroken prev.ghc-datasize; |
Does it happen locally, too? I am confused, can't exactly spot what's wrong right now. |
Yes, it does happen locally.
|
7ff02db
to
6eb7d3c
Compare
Need some way to run |
Yeah, running the loadtest in CI is not working when dependencies are changed, because it needs to run against the base branch, which doesn't have the dependencies, yet... and then cabal just breaks it somehow. You can run something like this locally to get the same markdown output:
(but it's likely that it fails the same way... :D) |
The thing I don't understand about this error message is, that it appears in contexts where we don't use profiling libs. We do for the memory test, so the error message makes "kind of sense" (I still don't understand why it happens, though). But for loadtest and the regular build on MacOS... those don't use profiling. Hm... actually - we don't do a regular dynamically linked linux build via nix, I think. So in fact it fails for every nix build except the static build. Still don't know what's happening, though. |
I get a similar error when trying this locally too.
To unblock the PR, how about only adding the dependency in another PR and then merging it? Then I assume this PR would run the loadtest? |
No I don't think so. It seems the dependency issue was fixed a while ago in 0c5d2e5. Also the fact that all nix builds fail, the memory test, on darwin, etc. - indicates that there is something else going on. |
b5331af
to
de80a9e
Compare
de80a9e
to
7034d38
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work 🚀
The codecov failures look related to Internal.hs, which should go once #3881 is solved.
7034d38
to
df3b557
Compare
src/PostgREST/Auth.hs
Outdated
-- if token not found, add to cache and increment cache size metric | ||
case (authResult,checkCache) of | ||
(Right res, Nothing) -> C.insert' (getJwtCache appState) (getTimeSpec res maxLifetime utc) token res | ||
(Right res, Nothing) -> do | ||
let tSpec = getTimeSpec res maxLifetime utc | ||
C.insert' jwtCache (Just tSpec) token res | ||
entrySize <- calcCacheEntrySizeInBytes (token, res, tSpec) | ||
observer $ JWTCache entrySize -- adds size to metric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand how is the incremental calculation done. Where is the JWT cache size stored and then incremented? My thinking is that the cache size should be stored somewhere in AppState, but there's nothing added there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, I think this PR isn't quite ready yet. I am now thinking of a better design. I'll take another look tomorrow.
@steve-chavez I think we need a design decision here. We are currently allowing our in-memory JWT cache to grow infinitely larger with no upper bound. I think this is bad because it will cause memory exhaustion and impact performance. We also don't need an explicit expiration support that Switch to LRU CacheAs you said earlier that we need an LRU cache for caching to be production ready, how about we switch our cache implementation to https://hackage.haskell.org/package/lrucache.This would also mean that we need a potentially configurable maximum cache size (add new config?). WDYT? |
@taimoorzaeem Yes, that is the end goal. However wouldn't that imply a breaking change? It would require a new major version. Since the current JWT cache is broken (#3788), we should release a new minor with the fix and then do the LRU cache for the new major. |
We can't release a new minor version. We can only release a new patch version on the v12 branch or a new major on the main branch. We still have a few breaking changes that we've been carrying for quite a while on main (removal of PostgreSQL), and we have never made a release for v13. |
How should we handle this in v12 then? Maybe put a "danger" note on https://postgrest.org/en/v12/references/auth.html#jwt-caching, telling that the feature is broken and should only be used on v13? |
Can we not run purgeExpired periodically? Why do we need this feature for it? |
There's no way to test that the fix is working correctly without this. Check https://github.com/PostgREST/postgrest/pull/3801/files#r1855025853. There the feature was added on the same PR as the fix.
Why is the above the case btw? We could not pick the breaking changes for the minor. |
If we can use this feature to test that a separate commit as a fix is working on the main branch, we might just backport the fix without the test.
Partly because that's the workflow we agreed on. We are backporting fixes, not features. But there's reason behind it:
I fully agree we need to fix the bug. But we should not introduce new risks by doing so. |
Agree, thanks for the reminder.
Sounds good. I'll just cherry-pick this PR commit locally while reviewing #3801 then. @taimoorzaeem I suggest keeping the LRU cache improvement after #3801 is merged. We should clear that first. |
Sure, sounds good! 👍 |
df3b557
to
1d88308
Compare
Add metric
pgrst_jwt_cache_size
in admin server which shows the cache size in bytes.